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In genome-wide association studies (GWASs) of colorectal cancer, we have identified two genomic regions in 
which pairs of tagging-single nucleotide polymorphisms (tagSNPs) are associated with disease; these com- 
prise chromosomes 1q41 (rs6691170, rs6687758) and 12q13.13 (rs7163702, rs1 1169552). We investigated 
these regions further, aiming to determine whether they contain more than one independent association 
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signal and/or to identify the SNPs most strongly associated with disease. Genotyping of additional sample 
sets at the original tagSNPs showed that, for both regions, the two tagSNPs were unlikely to identify a 
single haplotype on which the functional variation lay. Conversely, one of the pair of SNPs did not fully cap- 
ture the association signal in each region. We therefore undertook more detailed analyses, using imputation, 
logistic regression, genealogical analysis using the GENECLUSTER program and haplotype analysis. In the 
1q41 region, the SNP rs1 11 18883 emerged as a strong candidate based on all these analyses, sufficient to 
account for the signals at both rs6691 1 70 and rs6687758. rs1 1 11 8883 lies within a region with strong evidence 
of transcriptional regulatory activity and has been associated with expression of PDGFRB mRNA. For 
12q13.13, a complex situation was found: SNP rs7972465 showed stronger association than either 
rs1 11 69552 or rs71 36702, and GENECLUSTER found no good evidence for a two-SNP model. However, logis- 
tic regression and haplotype analyses supported a two-SNP model, in which a signal at the SNP rs706793 was 
added to that at rs1 11 69552. Post-GWAS fine-mapping studies are challenging, but the use of multiple tools 
can assist in identifying candidate functional variants in at least some cases. 



INTRODUCTION 

Using genome-wide association studies (GWASs), we have 
identified 14 regions that contain tagging single nucleotide poly- 
morphisms (tagSNPs) associated with the risk of colorectal 
cancer (CRC) ( 1 ). Within three of these regions — chromosomes 
14q22.2, 1 5ql 3 .3 and20pl2.3 — we have shown that there exist 
two SNPs that are independently associated with disease (2). In 
two further regions — chromosomes lq41 and 12ql3.13 — there 
are two SNPs associated with CRC risk, but from the original 
GWA analysis, it was unclear as to whether these represented 
independent signals of association (1). At lq41, these SNPs 
arers6691170 (chrl: 220,112,069 bases) andrs6687758 (chrl: 
220,231,571); they are in modest pairwise linkage disequilib- 
rium (LD) (r 2 =0.22; U = 0.71). At 12ql3.13, the two 
SNPs are rs7136702 (chrl2: 49,166,483) and rsl 1169552 
(chrl2: 49,441,930); these SNPs too are moderately correlated 
(r 2 = 0.11, 1/ = 0.76). Our previous analyses had not resolved 
the issue of whether there could be more than one independent 
CRC SNP in either of these regions (1). 

One of the aims of GWASs is the discovery of functional/ 
causal variants, the effects of which are manifest in the 
tagSNP associations. It is, however, very challenging to 
proceed from a tagSNP association to identifying functional 
variants, and relatively few such studies have been reported to 
date. One reason for this is that the correlation matrix 
between tagSNP(s) and functional variant(s) at any locus may 
be complex. If two association signals occur at tagSNPs at 
the same locus, the possible causes include the following: 

(i) the associated tagSNPs are in LD; 

(ii) there are two independent functional sites, each in LD 
with one tagSNP; 

(iii) there are two functional sites, but there is true epistasis; 

(iv) there is a single functional site on a haplotype defined by 
the two tagSNPs; 

(v) there are > 2 independent functional sites in LD with one 
or more tagSNPs; 

(vi) there is a mixture of the above possibilities. 

It can be extremely hard to distinguish among these possi- 
bilities and our inability to de-convolute association signals 
may help explain why so much of the heritability of 



complex diseases is unexplained by GWASs to date (3). 
Despite these problems, functional variant discovery may be 
aided by a deeper examination of genetic variation in the 
LD blocks in which the tagSNPs reside. Such discovery is 
likely to benefit from efforts such as the 1000 Genomes 
Project, where a comprehensive discovery of novel variants 
has been carried out in several populations. 

In this study, we had three aims. First, we wished to inves- 
tigate as fully as possible whether there was likely to be one or 
more than one functional variant underlying the association 
signals at lq41 and 12ql3. 13 in CRCs. Secondly, we wanted 
to investigate other tagSNPs in these regions for evidence of 
further, independent association signals. Thirdly, we wished 
to use imputation and functional annotation to refine the 
most likely location of the 'disease-causing' variant in both 
the lq41 and 12ql3. 13 regions. 



RESULTS 

The lq41 region 

We genotyped a total of 48 174 samples (22 832 cases and 25 
892 controls) from 17 sample sets at rs6691170 and 
rs6687758. This analysis included five replication case/ 
control cohorts that were not previously reported (1) for 
these SNPs: Kentucky; Prague; EPICOLON; Leiden; and 
Australia. After meta-analysis in STATA, both rs6691170 
and rs6687758 were, as expected, significantly associated 
with CRC risk (Table 1), with no evidence of heterogeneity 
among studies. Incorporating both SNPs into an unconditional 
logistic regression model showed that neither of the pair of 
SNPs fully captured the association signal in the region 
[odds ratio (OR) = 1.06, P= 1.06 x 10" 4 for rs6691170 
and OR= 1.07, P=2.48 x 10" 4 for rs6687758]. We used 
PLINK to examine the possibility that the two tagSNPs indi- 
cated a single high-risk haplotype on which an unknown func- 
tional SNP was present (that is, all the functional risk alleles 
resided on a haplotype composed solely of one of the four pos- 
sible pairs of tagSNP alleles). However, the association signal 
was not simply present on the high-risk haplotype TG (for 
rs6991170|rs6687758). Instead, the risks for the 'compound' 
(high-low or low-high) haplotypes — GG and TA — were 
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Table 1. Summary of genotyping and association results at the original four tagSNPs on lq41 and 12ql3. 13 in the extended data sets 



Summary 


Series 


Cal 1 


Cal2 


Ca22 


Col 1 


Col2 


Co22 


C a 1 


C a2 


C o 1 


Co2 


MAF 
ca 


MAF 

CO 


OR 


Ntot 


Nca 


Nco 


rso69 1 1 /U, cnrl . 


UK1/CUKG1 


1 30 


435 


355 


100 


429 


393 


695 


1 145 


629 


1215 


0.38 


0.34 


1.172 


1215 


920 


922 


220,1 12,069; z — 6.8 /; 


Scotland 1/CUGS 


1 34 


463 


379 


130 


433 


435 


73 1 


1221 


693 


1303 


0.37 


0.35 


1.126 


1303 


976 


998 


D C A"> 1A — 12. 

r — 6.42 X 10 , 


UKz/NSCCG 


398 


1395 


1058 


355 


1304 


1 159 


2191 


351 1 


2014 


3622 


0.38 


0.36 


1.122 


3622 


285 1 


2818 


UK — 1 . 1U, 9j /o 


scotiandz/succs 


24S 


94 1 


,8 1 7 


2j9 


OAT 


85 1 


1437 


2575 


1445 


2669 


n lA 


0.35 


1 .03 1 


2669 


2006 


2057 


CI — 1.0/ — l.lz, 


VQ58 


277 


833 


6,88 


359 


1234 


1096 


1387 


2209 


1952 


3426 


0.39 


0.36 


1.102 


3426 


1798 


2689 


Phet - 0.39; 


CFR 


144 


581 


447 


155 


436 


399 


879 


1475 


746 


1234 


0.37 


0.38 


0.986 


1234 


1177 


990 


I 1 - 5.9%; allele 


UK3/NSCCG 


400 


1448 


1137 


367 


1251 


1198 


2260 


3722 


1985 


3647 


0.38 


0.35 


1.116 


3647 


2991 


2816 


1 - T; allele 2 = G 


Scotland3/SOCCS 


103 


376 


326 


1 17 


447 


363 


582 


1028 


681 


1173 


0.36 


0.37 


0.975 


1173 


805 


927 




UK4/CORGI2BCD 


70 


212 


213 


129 


473 


445 


352 


638 


731 


1363 


0.36 


0.35 


1.029 


1363 


495 


1047 




Cambridge 


324 


1068 


,805 


280 


1013 


890 


1716 


2678 


1573 


2793 


0.39 


0.36 


1.138 


2793 


2197 


2183 




COIN/NBS 


300 


1054 


797 


326 


1170 


1005 


1654 


2648 


1822 


3180 


0.38 


0.36 


1.090 


3180 


2151 


2501 




Helsinki 


143 


435 


351 


105 


372 


340 


721 


1137 


582 


1052 


0.39 


0.36 


1.146 


1052 


929 


817 




Prague 


147 


424 


363 


82 


317 


252 


71,8 


1150 


481 


821 


0.38 


0.37 


1.066 


821 


934 


651 




Kentucky 


156 


466 


3,8,8 


244 


709 


630 


77,8 


1242 


1197 


1969 


0.39 


0.38 


1.030 


1969 


1010 


1583 




EPICOLGN 


193 


613 


520 


184 


632 


57,8 


999 


1653 


1000 


1788 


0.38 


0.36 


1.081 


1788 


1326 


1394 




Australia 


64 


223 


129 


58 


212 


16,8 


351 


481 


328 


548 


0.42 


0.37 


1.219 


548 


416 


438 




Leiden 


141 


404 


310 


92 


291 


304 


6,86 


1024 


475 


899 


0.40 


0.35 


1.268 


,899 


855 


687 


rsooo / /58, cnrl . 


UK1/CUKG1 


37 


3 12 


568 


32 


299 


598 


3,86 


1448 


363 


1495 


0.2 1 


0.20 


1.098 


1495 


9 1 7 


929 


nn ni cti , „ a c a . 

22U,Zi 1,5 ti, Z — 5.64, 


Scotland 1/CUGS 


63 


308 


606 


34 


325 


642 


434 


1520 


393 


1609 


0.22 


0.20 


1.169 


1609 


977 


1001 


r — l./U X 1U , 


UJvZ/JN&C Co 


1 2 1 


985 


1 746 


98 


898 


1 822 


1227 


4477 


1 094 


4542 


0.22 


0.19 


1.138 


4542 


2852 


28 1 8 


UK — 1.U9, 95 /o 


Scotlandz/SUCCS 


77 


694 


1235 


74 


639 


1344 


848 


3164 


787 


3327 


0.2 1 


0.19 


1.133 


3327 


2006 


2057 


CI — 1.06 — l.Lj, 


VQ58 


86 


605 


1 106 


1 1 3 


832 


1742 


111 


2817 


1058 


4316 


0.22 


0.20 


1.125 


43 16 


1797 


2687 


Phet - 0.28; 


CFR 


50 


364 


756 


51 


327 


607 


464 


1876 


429 


1541 


0.20 


0.22 


0.888 


1541 


1170 


985 


j 2 = 14.8%; allele 


UK3/NSCCG 


122 


947 


1920 


115 


850 


1861 


1191 


4787 


1080 


4572 


0.20 


0.19 


1.053 


4572 


2989 


2826 


1 - G; allele 2 = A 


Scotland3/SOCCS 


4,8 


263 


519 


51 


315 


566 


359 


1301 


417 


1447 


0.22 


0.22 


0.958 


1447 


830 


932 




UK4/CORGI2BCD 


24 


158 


306 


45 


309 


669 


206 


770 


399 


1647 


0.21 


0.20 


1.104 


1647 


488 


1023 




Cambridge 


,89 


755 


1366 


76 


664 


1444 


933 


3487 


816 


3552 


0.21 


0.19 


1.165 


3552 


2210 


2184 




COIN/NBS 


102 


701 


1330 


89 


770 


1642 


905 


3361 


948 


4054 


0.21 


0.19 


1.151 


4054 


2133 


2501 




Helsinki 


67 


385 


476 


49 


317 


437 


519 


1337 


415 


1191 


0.28 


0.26 


1.114 


1191 


928 


803 




Prague 


47 


335 


552 


33 


230 


388 


429 


1439 


296 


1006 


0.23 


0.23 


1.013 


1006 


934 


651 




Kentucky 


41 


312 


657 


57 


509 


1017 


394 


1626 


623 


2543 


0.20 


0.20 


0.989 


2543 


1010 


1583 




EPICOLON 


57 


429 


,840 


46 


442 


906 


543 


2109 


534 


2254 


0.20 


0.19 


1.087 


2254 


1326 


1394 




Australia 


25 


151 


264 


17 


152 


269 


201 


679 


186 


690 


0.23 


0.21 


1.098 


690 


440 


438 




Leiden 


45 


284 


521 


28 


212 


448 


374 


1326 


268 


1108 


0.22 


0.19 


1.166 


1108 


850 


688 


rs/ 1 jo /02; cnrlz: 


UKl/CURGl 


1 3 1 


433 


357 


1 1 3 


430 


386 


695 


1 147 


656 


1202 


0.38 


0.35 


1.110 


1202 


92 1 


929 


AC\ ] CC AOI. — C C(\. 

49,166,483, z — 6.69, 


Scotland 1/CUGS 


146 


443 


388 


1 26 


444 


43 1 


735 


1219 


696 


1306 


0.38 


0.35 


1.131 


1306 


977 


1001 


r — Z.23 X 1U , 


UKz/NSCCG 


380 


1331 


1 140 


329 


1306 


1 183 


2091 


361 1 


1964 


3672 


0.37 


0.35 


1.083 


3672 


285 1 


2818 


nn i 1 n. nco/ 
UK — 1. 1U, 95 /o 


Scotland2/SUCCS 


276 


975 


755 


275 


935 


847 


1527 


2485 


1485 


2629 


0.38 


0.36 


1.088 


2629 


2006 


2057 


CI - 1.07-1.12; 


VQ58 


237 


869 


694 


295 


1290 


1102 


1343 


2257 


1880 


3494 


0.37 


0.35 


1.106 


3494 


1800 


2687 


Phet - 0.51; 


CFR 


155 


604 


427 


103 


444 


450 


914 


1458 


650 


1344 


0.39 


0.33 


1.296 


1344 


1186 


997 


I 2 = 0.0%; allele 


UK3/NSCCG 


402 


1388 


1190 


359 


1283 


1180 


2192 


3768 


2001 


3643 


0.37 


0.35 


1.059 


3643 


2980 


2822 


1 - T; allele 2 - C 


Scotland3/SOCCS 


118 


310 


270 


122 


401 


356 


546 


850 


645 


1113 


0.39 


0.37 


1.108 


1113 


698 


879 




UK4/CORGI2BCD 


81 


215 


190 


151 


466 


444 


377 


595 


768 


1354 


0.39 


0.36 


1.117 


1354 


486 


1061 




Cambridge 


332 


955 


903 


261 


1015 


906 


1619 


2761 


1537 


2827 


0.37 


0.35 


1.079 


2827 


2190 


2182 




COIN/NBS 


287 


893 


,844 


321 


1121 


1059 


1467 


2581 


1763 


3239 


0.36 


0.35 


1.044 


3239 


2024 


2501 




Helsinki 


103 


389 


436 


72 


334 


414 


595 


1261 


478 


1162 


0.32 


0.29 


1.147 


1162 


928 


820 




Prague 


,85 


419 


430 


57 


291 


303 


5,89 


1279 


405 


897 


0.32 


0.31 


1.020 


897 


934 


651 




Kentucky 


140 


478 


392 


215 


750 


61,8 


758 


1262 


1180 


1986 


0.38 


0.37 


1.011 


1986 


1010 


1583 




EPICOLON 


19,8 


642 


4,86 


187 


623 


584 


1038 


1614 


997 


1791 


0.39 


0.36 


1.155 


1791 


1326 


1394 




Leiden 


115 


388 


341 


92 


269 


321 


618 


1070 


453 


911 


0.37 


0.33 


1.162 


911 


844 


682 


rsl 1 169552; cm 12: 


UK.l/COR(jl 


56 


328 


537 


67 


350 


5 12 


440 


1402 


484 


1374 


0.24 


0.26 


0.891 


1374 


92 1 


929 


AC\ AA\ C\1f\. — C OO. 

49,441,930, z — 6.88, 


Scotland l/COOS 


60 


369 


544 


76 


406 


5 1 9 


4,89 


1457 


558 


1444 


0.25 


0.28 


0.869 


1444 


973 


1001 


d c nn i n 12. 

— 5.99 X 1U , 




209 


1062 


1580 


1 99 


1 124 


1494 


1480 


4222 


1522 


41 12 


0.26 


0.27 


0.947 


4112 


285 1 


2817 


f\rt . a nn. nco/ 

UK — 0.90, 95 /0 


Scotlandz/SUCCS 


1 1 1 


808 


1087 


1 52 


82 1 


1084 


1030 


2982 


1 125 


2989 


0.26 


0.27 


0.918 


2989 


2006 


2057 


CI -0.88-0.93; 


VQ58 


109 


665 


1026 


201 


1046 


1442 


883 


2717 


1448 


3930 


0.25 


0.27 


0.882 


3930 


1800 


2689 


Phet - 0.50; 


CFR 


72 


450 


663 


73 


408 


516 


594 


1776 


554 


1440 


0.25 


0.28 


0.869 


1440 


1185 


997 


I 1 - 0.0%; allele 


UK3/NSCCG 


167 


1179 


1625 


214 


1142 


1463 


1513 


4429 


1570 


4068 


0.25 


0.28 


0.885 


4068 


2971 


2819 


1 - T; allele 2 - C 


Scotland3/SOCCS 


14 


127 


176 


80 


321 


490 


155 


479 


481 


1301 


0.24 


0.27 


0.875 


1301 


317 


891 




UK4/CORGI2BCD 


34 


175 


277 


80 


395 


554 


243 


729 


555 


1503 


0.25 


0.27 


0.903 


1503 


486 


1029 




Cambridge 


155 


824 


1241 


163 


853 


1172 


1134 


3306 


1179 


3197 


0.26 


0.27 


0.930 


3197 


2220 


2188 




COIN/NBS 


135 


818 


1107 


189 


973 


1338 


1088 


3032 


1351 


3649 


0.26 


0.27 


0.969 


3649 


2060 


2500 




Helsinki 


103 


407 


401 


153 


356 


303 


613 


1209 


662 


962 


0.34 


0.41 


0.737 


962 


911 


812 




Prague 


51 


375 


508 


38 


273 


340 


477 


1391 


349 


953 


0.26 


0.27 


0.936 


953 


934 


651 




Kentucky 


5,8 


377 


575 


93 


665 


825 


493 


1527 


851 


2315 


0.24 


0.27 


0.878 


2315 


1010 


1583 




EPICOLON 


56 


453 


,817 


75 


471 


84,8 


565 


2087 


621 


2167 


0.21 


0.22 


0.945 


2167 


1326 


1394 




Leiden 


53 


304 


492 


57 


251 


37,8 


410 


1288 


365 


1007 


0.24 


0.27 


0.878 


1007 


849 


686 



Ca, cases; Co, controls; 11, rare homozygote; 12, heterozygote; 22, common homozygote; 1, minor allele; 2, major allele. Allele 1 is risk allele for rs6691 170, rs6687758 and 
rs7136702; allele 2 is risk allele for rsl 1 169552. MAF, minor allele frequency; OR, odds ratio. 
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HHIP U 

TAFIA 



1 1 1 ' 

220100 220400 220700 

Chromosome 1 position (hg18) (kb) 

Figure 1. Individual SNP associations in the 1 q4 1 region. Association testing was performed in SNPtest using typed and imputed genotypes from the three 
GWAS series (UK1, Scotlandl and VQ58) and displayed using SNAP (http://www.broadinstitute.org/mpg/snap/). The X-axis shows position on chromosome 
1 and the Y-axis, — loglO(P) from the per allele association test. The most strongly associated SNP, rsl 1 1 18883, is shown as a large diamond, and the colours of 
other data points reflect the LD between that SNP and rsl 111 8883. The smaller diamond points indicate genotyped SNPs and the triangles indicate imputed 
SNPs. The blue line represents recombination rates. 



Table 2. Two-SNP logistic regression analysis showing best signal in the 1 q4 1 region in comparison with the originally reported SNPs 



SNPs 


Positions (bases) 


LD (;- 2 , D 1 ) 


Risk allele (freq oas 


!s , freq colltro i s ) No. cases, no. controls 


OR 


95% CI 


Z 


/"-value 


AIC 


rs6691170 


220,112,069 


0.15, 


T (0.38,0.36) 


3272, 


1.09 


1.01-1.17 


2.95 


0.0032 


10486 


rs6687758 


220,231,571 


0.65 


G (0.22,0.20) 


4572 


1.08 


0.98-1.18 


1.61 


0.108 




rsl 11 18883 


220,127,645 


0.92, 


A (0.32,0.29) 


3206, 


2.32 


1.49-3.62 


3.71 


2.07 x 10~ 4 


10475 


rsl2726661 


220,134,411 


1.00 


A (0.68,0.71) 


4452 


0.49 


0.32-0.76 


3.16 


1.58 x 10~ 3 





LD is shown for the pair of SNPs being tested. OR, odds ratio; AIC, Akaike information criterion (AIC — —2* log-likelihood + 2 * (number of parameters)). Note 
the lower AIC, showing a better model fit, for the test of rsl 1 1 18883 + rsl2726661 compared with rs6691 170 + rs6687758. Individual AICs for these four SNPs 
were, respectively, 10487, 10489, 10484 and 10487. 



greater than those for the low-low haplotype (GA), inconsist- 
ent with a functional SNP being in complete LD with a haplo- 
type indicated by the pair of tagSNPs (Supplementary 
Material, Table SI). We also tested for evidence of epistasis 
between rs6691 170 and rs6687758 using case-control logistic 
regression analysis, incorporating interaction between SNPs as 
a variable, but no evidence of deviation from log-additive SNP 
effects was found (P = 0.292). 

Having failed to find evidence for the simplest situations — 
namely that one of each tagSNP pair captured the great major- 
ity of the association signal or that the tagSNPs essentially 
acted as simple two-locus tags for the functional variants in 
each region — we attempted to deconvolute the lq41 signal 
by association testing of imputed SNPs in the region. The 
three GWAS sample sets, UK1, Scotland 1 and VQ58, were 
imputed to the combined 1000 genomes and HapMap3 refer- 
ence set. A total of 630 SNPs in the 220-221 Mb region on 
chromosome lq41 was successfully imputed from 76 geno- 
typed SNPs. The strongest association signal (Fig. 1, Supple- 
mentary Material, Table S2), as measured by association test 



P-value, was at rsl 11 18883 (chrl: 220,127,645), an imputed 
SNP in moderate LD with rs6691170 (r 2 = 0.40, D = 0.74) 
and rs6687758 (r 2 = 0.31, D' = 0.77). 

We then used reverse stepwise logistic regression analysis 
to determine whether rs6691 170 and rs6687758, or other com- 
binations of SNPs, best accounted for the association between 
CRC and lq41 variation. Using a final significance threshold 
of P= 0.01, we found that two imputed SNPs, rsl 11 18883 
and rsl 2726661, were most strongly associated with the 
CRC risk (Table 2, Supplementary Material, Table S2). By 
comparison, a joint analysis of rs6687758 and rs6691170 in 
the same three GWAS data sets gave much weaker evidence 
of association, as assessed using the Akaike Information Cri- 
terion (AIC). Indeed, a model incorporating rs 11118883 
alone — although not one with rsl 2726661 alone — provided a 
better fit than a model incorporating both rs6687758 and 
rs6691170; haplotype-based association analysis supported 
these findings (data not shown). 

We were surprised to note that in a single-SNP analysis the 
direction of effect for rsl 2726661 was reversed — the minor 
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allele was associated with disease risk — compared with that in 
the two-SNP analysis. We determined that rsl 11 18883 and 
rsl2726661 were in strong LD (r 2 = 0.98, £>' = 1.00) in our 
samples, consistent with data from the 1000 genomes project 
and HapMap3 that had been used for imputation. Examination 
of the genotype distribution in our data set showed that devi- 
ation from perfect LD between the SNPs resulted from two 
sets of individuals: (i) 50 homozygous for the major allele at 
rsl2726661 and heterozygous at rsl 11 18883; and (ii) 15 het- 
erozygous at rsl 2726661 and homozygous for the minor 
allele at rsl 11 18883. Specifically, 28/50 in category (i) were 
cases and 4/11 in category (ii) were cases. For these 65 indi- 
viduals, the risk of CRC was significantly greater than that 
of individuals with the other genotypes at rsl 2726661 and 
rsllll8883 (OR = 2.10, P= 0.003, xi test). A potential 
explanation for our apparently paradoxical findings is that 
there exists another allele, almost certainly relatively 
rare, that is associated with the minor allele of rsl 2726661 
(but not with rsl 11 18883), and that is protective against the 
CRC risk. 

We then analysed our UK1, Scotland 1 and VQ58 indivi- 
duals using GENECLUSTER with the original GWAS SNP 
genotypes in the rs6691 170/6687758 region as inputs. There 
was no evidence to favour an underlying two-locus model 
over a one-SNP model (Fig. 2). The predicted most strongly 
associated single SNP was rsl 1577023, a SNP that is in 
very strong LD with rsllll8883 (r 2 = 0.93, V = 1.0) in 
our data. 

We genotyped rsl 1 1 18883 directly in a set of 84 UK control 
samples and found complete concordance with the imputed gen- 
otypes, rsl 11 18883 (chr 1:220, 127,645) lies in a gene desert, 
within a region of LD that extends approximately from 220.0 
to 220.3 Mb. The nearest gene, ~150kb towards the centro- 
mere, is the MAP kinase regulator dual-specificity phosphatase 
10 (DUSP10). DUSP10 inactivates p38 and also the Jun 
N-terminal kinase that phosphorylates c-Jun which is believed 
to play a role in CRC pathogenesis, rsl 11 18883 itself lies 
upstream of DUSP10 within a region with strong evidence of 
transcriptional regulatory activity (http://genome.ucsc.edu). 
Using 1000 genomes data, we found that rsl 11 18883 is in 
strong LD (r 2 > 0.7) with at least six SNPs (rsl2738322, 
rsl2726661, rs4129271, rsl 1577023, rsl0746414 and 
rsl2137702). Of these, rsl0746414 and rsl2137702 are also 
close to regions with potential effects on transcription. 

The 12ql3.13 region 

Analysis of the 12ql3.13 region proceeded in parallel with 
that of the lq41 region using essentially the same strategy. 
We initially confirmed the individual associations of SNPs 
rs7136702 and rsl 1 169552 with the CRC risk in the extended 
data sets (Table 1). Unconditional logistic regression analysis 
did not exclude the possibility that the two SNPs had 
independent effects; for rs7136702 and rsl 1169552, the 
association statistics were P= 1.63 x 10~ 5 (OR= 1.07) and 
P=1.70xl0" 7 (OR =0.92), respectively, showing that 
one SNP did not simply capture all of the association 
signals. Further analysis showed that the association signal 
was not derived from a single high-risk haplotype tagged 
by rs7136702 and rsl 1169552 (Supplementary Material, 



Table S3) and there was no evidence of epistasis between 
the SNPs (P = 0.903). 

We imputed SNPs within the 48.5-50 Mb region of 
chromosome 12 using the combined 1000 genomes and 
HapMap3 reference panel in the 3 GWAS sample sets 
(UK1, Scotland 1 and VQ58) (Fig. 3, Supplementary Material, 
Table S4). A total of 2736 SNPs was successfully imputed 
from 158 genotyped SNPs. The most significant single-SNP 
association was at the imputed SNP rs7972465 [OR= 1.18, 
95% confidence interval (CI) 1.11-1.27, P= 8.22 x 10" 7 ), 
a signal slightly stronger than that of rsl 1169552 (OR = 
0.85, 95% CI 0.79-0.91, P = 1.08 x 10" 5 ) and notably stron- 
ger than that of rs7136702 (OR= 1.13, 95% CI 2.06-1.21, 
P = 3.85 x 10~ 4 ). Direct genotyping in 91 UK control indivi- 
duals showed that imputation of rs7972465 was very good, 
although not perfect (r 2 = 0.93). 

Reverse stepwise logistic regression analysis was then used 
to assess whether rsl 1 169552 and rs7136702, or other combi- 
nations of SNPs in the region, best accounted for the associ- 
ation between CRC and 12ql3.13 variation (Table 3). Many 
highly correlated SNPs exist within the region, making this 
analysis difficult. Nonetheless, while rsl 1169552 remained 
in the regression model after stepwise elimination of less 
strongly associated SNPs, a number of SNPs provided 
improved or similar associations compared with rs7 136702 
in a two-SNP model with rsl 1169552. One of these SNPs 
was rs7972465 (Table 3, Fig. 4), but another SNP, rs706793, 
a SNP in very low LD with rsl 1169552 (Table 3), provided 
a larger improvement in the AIC (see also Supplementary 
Material, Table S4). 

We then undertook GENECLUSTER analysis of the UK1, 
Scotland 1 and VQ58 sample sets in the 12ql3.13 region. 
There was no good evidence to distinguish between under- 
lying two-locus and one-locus models (Fig. 5), although the 
association signal showed two peaks at ~48.85 Mb (close to 
rs706793) and at ~49.45 Mb (very close to rsl 1169552) 
that could not readily be explained by long-range LD 
between these two regions (Supplementary Material, 
Fig. SI). The predicted most strongly associated SNP under 
the one-SNP model was rs3184122 (Supplementary Material, 
Table S4), a variant that is in moderate or strong LD 
(Fig. 5) with rsl 1169552 (r 2 = 0.19, D' = 0.92), rs7136702 
(r 2 = 0.49, D' = 0.73) and rs706793 (r 2 = 0.47, D' = 0.94), 
and strong LD with rs7972465 (r 2 = 0.87, D' = 1.00). 

Since the various analyses had not resolved the question of 
whether there exist one or two independent CRC-associated 
SNPs in the 12ql3.13 region, we used PLINK to examine 
the associations with disease of the haplotypes (Fig. 4) for 
rs706793, rs7972465 and rsl 1169552. As expected, the haplo- 
type CGC was most strongly associated with risk (Table 4, 
Supplementary Material, Table S5). The G (risk) allele at 
rs7972465 was essentially present only on this haplotype, 
but it appeared that haplotypes containing the T allele at 
rs7972465 were not all low risk and therefore that 
rs7972465 did not explain all the association signal. We there- 
fore considered the association signals when we fixed the 
alleles at rs706793 and rsl 1169552 and allowed those at 
rs7972465 to vary, and vice versa. Initially, we undertook 
simple comparisons between haplotype frequencies in cases 
and controls, and found that the rs706793 and rsl 1169552 
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Figure 2. GENECLUSTER output for the lq41 region. The upper left panel compares the Bayes factors (BFs) for models in which the association signals at 
rs6691 170 and 6687758 are derived from either one functional SNP (red) or two functional SNPs (green). Recombination rates are also shown as a red line. The 
upper right panel shows the log 10 (BF) at the focal position — the site of the highest log 10 (BF), here chrl:220,129,000 bases — under one- and two-SNP models. 
The lower right panel shows reconstructed genealogies for UK1, Scotlandl and VQ58 combined, based on each individual's genotypes in the region from the 
Illumina Hap300/370/550 panels and HapMap2 data. The most likely positions of SNP origins under the one-SNP model (blue, rsl 1577023) and two-SNP model 
(green and red) are shown. These result in counts of cases and controls and relative risks as indicated in the upper right panel. The lower left panel shows hap- 
lotypes (rows) and SNPs (columns). Note that the region analysed extends for several Mb flanking rs6691 170 and 6687758; although no signal reaches nominal 
significance at logio(BF) = 4, there is some evidence of a second independent region of lq associated with CRC at ~218.2 Mb, as we have reported previously. 
The importance of rsl 1577023 was supported by the Margarita analysis in which it was the second most strongly associated with disease (P = 3.59 x 10~ ). 
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Figure 3. Individual SNP associations in the 12q 13.13 region. Legend is as for Figure 1. 



Table 3. Two-SNP logistic regression analysis showing best signals in the 12q 13.13 region in comparison with the original reported SNPs 



SNPs 


Genotyped or 


Positions 


LD (r 2 , O) 


Risk allele (freq cases , freq contro i s ) 


No, cases, 


OR 


95% CI 


z 


P-value 




AIC 




imputed? 


(bases) 






no. controls 














rsll 169552 


Genotyped 


49,441,930 


0.040 


C (0.76, 0.73) 


3276 


0.92 


0.81-0.94 


3.37 


7.52 x 10" 


4 


10473 


rs7 136702 


Genotyped 


49,166,483 


0.57 


T (0.37, 0.35) 


4576 


1.08 


1.01-1.16 


2.13 


0.033 






rsll 169552 


Genotyped 


49,441,930 


0.19 


C (0.76,0.73) 


3206 


0.89 


0.82-0.97 


2.69 


0.007 




10469 


rs3184122 


Imputed 


48,856,394 


0.92 


C (0.41,0.37) 


4480 


1.12 


1.04-1.21 


2.96 


0.003 






rsll 169552 


Genotyped 


49,441,930 


0.06 


C (0.76, 0.73) 


3268 


0.88 


0.81-0.95 


3.40 


6.74 x 10" 


4 


10468 


rs35031884 


Imputed 


49,063,840 


1.00 


A (0.32,0.29) 


4554 


1.15 


1.05-1.25 


3.19 


0.0014 






rsll 169552 


Genotyped 


49,441,930 


0.17 


C (0.76,0.73) 


3268 


0.90 


0.83-0.97 


2.64 


0.008 




10466 


rs7972465 


Imputed 


48,832,392 


0.89 


G (0.21,0.18) 


4563 


1.14 


1.06-1.22 


3.43 


6.04 x 10" 


4 




rsll 169552 


Genotyped 


49,441,930 


0.002 


C (0.76,0.73) 


3266 


0.84 


0.78-0.91 


4.56 


5.12 x 10" 


-6 


10426 


rs706793 


Genotyped 


48,754,036 


0.095 


C (0.60,0.57) 


4557 


0.49 


0.32-0.76 


3.23 


0.0012 







SNP pairs are shown in order of descending Aikake Information Criterion (AIC). Note that in single-SNP analysis, rs706793 provided only slightly worse evidence 
of association (OR = 0.90, 95% CI 0.84-0.96, P = 0.002) than in combined analysis with rsl 1169552. Individual AICs for rsll 169552, rs7136702, rs3184122, 
rs35031884, rs7972465 and rs706793 were, respectively, 10476, 10483, 10475, 10477, 10471 and 10447. Incorporation of rs7972465 into a regression model with 
rsl 1 169552 and rs706793 did not improve the model's fit (AIC = 10426). 



risk alleles, but not the rs7972465 risk allele, were found at 
significantly higher frequencies in cases than controls (Supple- 
mentary Material, Table S6). Since this analysis suggested that 
there might be independent effects of rs706793 and 
rsl 1169552 — and that the signal at rs7972465 resulted from 
LD with these two SNPs — we proceeded to a further evalu- 
ation of this possibility using conditional haplotype analysis 
in PLINK. We again compared two scenarios, (i) in which 
the CGC and CTC haplotypes were equivalent (that is, 
varying rs7972465) and (ii) in which the CTC and TTT hap- 
lotypes were equivalent (that is, varying rs706793 and 
rsl 1169552). No effect was seen in the first case (likelihood 
ratio test, P = 0.35), whereas there was a significant difference 
in the second case (P = 0.023), again supporting effects of 
rs706793 and rsl 1169552 rather than rs7972465. 



Further genotyping in additional sample sets strengthened the 
rs706793 association with CRC, although it did not reach formal 
significance and there was some evidence of inter-study hetero- 
geneity, the origins of which remain unclear (Supplementary 
Material, Table S7). Logistic regression analysis in the extended 
sample set continued to support a model incorporating rs706793 
and rsl 1169552 (P = 8.38 x 10~ 4 and P = 7.82 x 10" 6 , 
respectively, AIC = 27932) over one with rs7163702 
and rsl 1169552 (P = 3.05 x 10" 5 and P = 9.05 x 10" 3 , 
AIC = 27999). 

rs706793 (chrl2:48,754,036) and rsl 1169552 
(chrl2:49,441,930) are separated by a predicted recombination 
hotspot at ~48.8 Mb in the HapMap data (Fig. 3) but not in 
our own data (Fig. 5), although LD in the region is complex 
(Supplementary Material, Fig. SI). The 12ql3. 13 region 
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Figure 4. LD and main haplotypes at SNPs with best evidence of association on 12ql3.13. Note that in this Haploview output from HapMap3 data, the alleles at 
rs706793 are shown on the opposite strand (that is G/A rather than C/T as used in the rest of this manuscript). 



contains coding genes ACCN2, SMARCD1, GPD1, LASS5, 
LIMA1 and ATF1. ACCN2 probably encodes an ion channel 
protein, SMARCD1 is part of chromatin remodelling 
complex SNF/SWI, GPD1 is glycerol-3 -phosphate dehydro- 
genase and LASS5 is probably a ceramide synthase. LIMA1 
codes for EPLIN, a protein downregulated in some cancers. 
ATF1 is a transcription factor centrally involved in the stress 
response and in the pathogenesis of angiomatoid fibrous his- 
tiocytoma and clear cell sarcoma through translocation. Sup- 
plementary Material, Table S8 lists SNPs in strong LD (r 2 > 
0.70) with rs706793, rs7972465 or rsl 1169552, and provides 
annotation for those with evidence of potential roles in gene 
or protein regulation or function. 

DISCUSSION 

We have undertaken additional genotyping and more detailed 
analysis in order to understand better the dual tagSNP associ- 
ation signals that we observed on chromosomes lq41 
(rs6991170, rs6687758) and 12ql3.13 (rsl 1169552, 
rs7136702) in a GWAS of CRC (1). In both cases, genotyping 
of additional sample series confirmed the originally reported 
associations, without demonstrating good evidence for the 
three simplest scenarios: independent functional variants; 
capture of the association signal by one of the pair of SNPs; 
or two-SNP tagging of a single haplotype on which functional 
variation lay. We therefore proceeded to more detailed ana- 
lyses in each region, after imputation of genotypes where 
appropriate in the data sets with best coverage of each 
region (UK1, Scotlandl and VQ58). It is conceivable that 
the analysis of these three data sets, which had already been 
used in SNP discovery, would introduce a small amount of 
bias into the fine mapping. However, we reasoned that the 
marginal differences in association that might occur would 
be more than outweighed by the power provided by the use 
of these data sets. 

For lq41, the single-SNP association test, logistic regres- 
sion analysis and GENECLUSTER all found that SNP 
rsl 11 18883, or a SNP in strong LD, was most likely to be 



responsible for the signal of association. This SNP itself is a 
very good functional candidate, lying within or immediately 
adjacent to regions bearing histone methylation and 
acetylation marks, DNAse I hypersensitive sites and sites of 
transcription factor binding (http://genome.ucsc.edu/cgi-bin/ 
hgTrackUi?hgsid=195445293&c=chrl&g=wgEncodeReg). 
The SCAN expression Quantitative Trait Locus (eQTL) data- 
base (4) reports rsl 11 18883 being associated (P = 8 x 10~ 5 ) 
in Europeans with expression of platelet-derived growth 
factor |3 (PDGFRB, chr5q31-q32), although this association 
requires confirmation in appropriate cell types for the CRC 
risk and is not present in the Genevar eQTL database 
(http://www.sanger.ac.uk/resources/software/genevar) (5). 
The possibility that the minor allele of rsl 2726661 is asso- 
ciated with a second, presumably rare, variant that is protect- 
ive against the CRC risk is intriguing. While speculative, such 
a scenario has precedents, such as the MDM2 promoter SNP 
rsl 17039649 (6). 

For 12ql3. 13, a complex situation was found. Single-SNP 
analysis found variants with much stronger association 
signals than either rsl 1169552 or rs7136702, notably at 
rs7972465 although small imputation inaccuracies may have 
inflated this signal. GENECLUSTER analysis found no 
greater evidence for a two-SNP than one-SNP model and 
detected the best signal for the former at a SNP, rs3184122, 
that is in strong LD with rs7972465. Logistic regression 
analysis, however, supported a two-SNP model, in which 
a signal at rs706793 was added to that at 
rsl 1169552. rs706793 and rsl 1169552 are in very weak LD, 
but rs706793 is in moderate LD with rs7163702 (r 2 = 0.20, 
D' = 0.60). Haplotype analysis supported the logistic regres- 
sion analysis, in that the genotype at rs7972465 did not 
affect the risk associated with the rs706793-rsl 1 169552 hap- 
lotypes, whereas the reverse scenario (high- versus low-risk 
rs706793-rsl 1169552 haplotypes) did affect risk. As 
regards eQTLs for the 12ql3.13 SNPs, Genevar 
shows rs706793 to be associated with LASS5 expression 
(at P < 10~ 4 ), although this association is not reported in 
SCAN. 
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Figure 5. GENECLUSTER output for the 1 2q 13.13 region. The legend is as for Figure 2, except that the focal position is Chr 12:48, 849, 000 and the double peak 
of association at ~48.85 and 49.45 Mb should be noted. The top SNP (blue dot) under the one-SNP model is the imputed SNP rs3184122. The top-genotyped 
SNP in the GENECLUSTER analysis was rs7138945, which was the SNP with the second-best association signal in Margarita (P = 1.14 x 10 5 ). 

Clearly, all post-GWAS fine-mapping studies face intrinsic to differentiate among association signals of similar magni- 
difficulties, such as the use of imputed genotypes, despite the tudes. The analysis of the 12ql3. 13 region illustrates some 
use of stringent criteria for SNP inclusion, and a limited ability of these problems well. Although a much more strongly 
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Table 4. Haplotype analysis in the 12ql 3. 13 region 



Haplotype 


Freq. in cases Freq. in controls 


OR 


P-value 


TTT 


0.0922 0.1085 


0.82 


5.3 x 10~ 4 


CTT 


0.1407 0.1567 


0.87 


4.5 x 10~ 3 


CGC 


0.3829 0.3443 


1.19 


6.2 x 10~ 7 


TTC 


0.3102 0.3194 


0.96 


0.212 


CTC 


0.0739 0.0710 


1.05 


0.483 



Haplotypes (cen-tel) at rs706793, rs7972465 and rsl 1 169552 were analysed, 
notwithstanding the low LD between the first and last of these SNPs. Five 
haplotypes with frequencies of >0.01 were predicted. Ca, cases; Co, controls. 
OR, odds ratio relative to all other haplotypes. P-value is from analysis of 
effects of all haplotypes on disease risk in a logistic regression model. Odds 
ratios and P-values relative to reference haplotype TTT are given in 
Supplementary Material, Table S5. 

CRC -associated SNP than the original tagSNPs was identified 
through imputation, the balance of evidence slightly favours 
this signal resulting from two independent association 
signals, as we have previously found for the GREM1 locus 
(2). In the lq41 region, in contrast, rsl 11 18883— a SNP in 
moderate LD with both the original tagSNPs — emerged as 
an excellent candidate for the functional variant. 



MATERIALS AND METHODS 

Sample sets 

The Kentucky samples comprised 1020 incident colon 
cancer cases and 1598 population controls of white Euro- 
pean origin recruited between July 2003 and December 
2009. Eligible cases were identified through the population- 
based Surveillance, Epidemiology and End Results (SEER) 
Kentucky Cancer Registry covering all residents living in 
the State of Kentucky at the time of diagnosis. We used 
random digital dialling to recruit population controls who 
were 40 years of age or older and had no personal history 
of cancer other than skin cancer. We excluded those with 
known inflammatory bowel diseases, family history of 
familial adenomatous polyposis and hereditary non- 
polyposis CRC. 

The Prague cases (7) were patients with histologically con- 
firmed CRC recruited between September 2004 and February 
2009 from nine oncology departments in the Czech Republic: 
Prague (two), Benesov, Brno, Liberec, Pies, Pribram, Usti nad 
Labem and Zlin. During this period, a total of 1554 cases pro- 
vided blood samples. This study includes 1001 subjects who 
could be interviewed, provided biological samples and were 
genotyped. Controls were 683 hospital-based volunteers with 
negative colonoscopy results for malignancy or idiopathic 
bowel diseases (CFCC, cancer-free colonoscopy inspected 
controls). CFCCs were selected from among individuals 
admitted to the same hospitals during the same period of the 
recruitment of the cases. The reasons for undergoing the 
colonoscopy were: (i) positive faecal occult blood test, (ii) 
haemorrhoids, (iii) abdominal pain of unknown origin, or 
(iv) macroscopic bleeding. 

Details of other sample sets have been reported previously 
(2) and are provided briefly below. 



UK1 (CORGI) comprised 922 cases with colorectal neopla- 
sia (47% male) ascertained through the Colorectal Tumour 
Gene Identification (CORGI) consortium. All had at least 
one first-degree relative affected by CRC and one or more 
of the following phenotypes: CRC at age 75 or less; any colo- 
rectal adenoma (CRAd) at age 45 or less; >3 CRAds at age 75 
or less; or a large (>1 cm diameter) or aggressive (villous 
and/or severely dysplastic) adenoma at age 75 or less. The 
929 controls (45% males, 55% females) were spouses or part- 
ners unaffected by cancer and without a personal family 
history (to second degree relative level) of colorectal neopla- 
sia. Known dominant polyposis syndromes, FINPCC/Lynch 
syndrome or bi-allelic MUTYH mutation carriers were 
excluded. 

Scotlandl (COGS) included 980 CRC cases (51% male; 
mean age at diagnosis 49.6 years, SD + 6.1) and 1002 cancer- 
free population controls (51% male; mean age 51.0 years; 
SD + 5.9). Cases were for early age at onset (age < 55 
years). Known dominant polyposis syndromes, HNPCC/ 
Lynch syndrome or bi-allelic MUTYH mutation carriers were 
excluded. Control subjects were sampled from the Scottish 
population NHS registers, matched by age (+5 years), 
gender and area of residence within Scotland. 

VQ58 comprised 1832 CRC cases (1099 males, mean age of 
diagnosis 62.5 years; SD ± 10.9) from the VICTOR and 
QUASAR2 (www.octo-oxford.org.uk/alltrials/trials/q2.html) 
clinical trials of adjuvant therapy in stage II/III CRC. There 
were 2720 population control genotypes (1391 males) from 
the Wellcome Trust Case-Control Consortium 2 (WTCCC2) 
1958 birth cohort (also known as the National Child Develop- 
ment Study), which included all births in England, Wales and 
Scotland during a single week in 1958. 

The Australian study comprised 591 patients treated for 
CRC at the Royal Melbourne, Western and St Francis 
Xavier Cabrini Hospitals in Melbourne from 1999 to 2009. 
The 2353 controls were derived from Queensland or 
Melbourne: for the former, the controls came from the 
Brisbane Twin Nevus Study; for the latter, individuals were 
participants in the Genes in Myopia study. There was no 
overlap between the CFR and Australian data sets. Owing to 
potential residual ethnic heterogeneity within the Melbourne 
population, for the Australian cohort only we performed an 
additional screen to minimize heterogeneity after performing 
principal components analysis (PCA) to remove individuals 
who clustered with non-CEU individuals (see below). We 
achieved this by performing PCA on the Australian cases 
and controls without reference samples of known ancestry. 
We then paired each case with a control in a 1:1 ratio based 
on a maximum separation of 0.050 using the first and 
second eigenvectors. All unpaired samples were excluded, 
leaving 441 cases and 441 controls in the study. Calculation 
of the genomic inflation factor, \ GC , showed this to be 1.02 
after this filtering. 

UK2 (NSCCG) consisted of 2854 CRC cases (58% male, 
mean age at diagnosis 59.3 years; SD+ 8.7) ascertained 
through two ongoing initiatives at the Institute of Cancer 
Research/Royal Marsden Hospital NHS Trust (RMHNHST) 
from 1999 onwards — The National Study of Colorectal 
Cancer Genetics (NSCCG) and the Royal Marsden Hospital 
Trust/Institute of Cancer Research Family History and DNA 
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Registry. The 2822 controls (41% males; mean age 59.8 
years; SD + 10.8) were the spouses or unrelated friends of 
patients with malignancies. None had a personal history of 
malignancy at the time of ascertainment. All cases and con- 
trols had self-reported European ancestry, and there were no 
obvious differences in the demography of cases and controls 
in terms of place of residence within the UK. 

Scotland2 (SOCCS) comprised 2024 CRC cases (61% male; 
mean age at diagnosis 65.8 years, SD + 8.4) and 2092 popu- 
lation controls (60% males; mean age 67.9 years, SD + 9.0) 
ascertained in Scotland. Cases were taken from an independ- 
ent, prospective, incident CRC case series and aged <80 
years at diagnosis. Control subjects were population controls 
matched by age ( + 5 years), gender and area of residence 
within Scotland. 

UK3 (NSCCG) comprised 7912 CRC cases (65% male; 
mean age at diagnosis 59 years, SD+ 8.2) and 4398 controls 
(40% male; mean age 62 years, SD+ 11.5) ascertained 
through NSCCG post-2005. 

Scotland3 (SOCCS) comprised 1 145 CRC cases (50% male; 
mean age at diagnosis 53.2 years, SD + 15.4) and 2203 
cancer-free population controls (47% male; mean age 51.8 
years, SD + 11.5). Controls were recruited as part of the 
Generation Scotland study. 

UK4 (CORGI2BCD) consisted of 621 CRC or CRAd cases 
(46% male; mean age at diagnosis 58.3 years; SD + 14.1) and 
1121 cancer-free population or spouse controls (45% male; 
mean age 45.1 years, SD + 15.9), sampled using the same 
criteria as UK1. 

Cambridge/SEARCH consisted of 2248 CRC cases (56% 
male; mean age at diagnosis 59.2 years, SD+ 8.1) and 2209 
controls (42% males; mean age 57.6 years, SD+ 15.1). 
Samples were ascertained through the SEARCH (Studies of 
Epidemiology and Risk Factors in Cancer Heredity, http:// 
www.cancerhelp.org.uk/trials/a-study-looking-at-genetic-causes- 
of-cancer) study based in Cambridge, UK. Recruitment started 
in 2000; initial patient contact was though the general practi- 
tioner. Control samples were collected post-2003. Eligible 
individuals were sex and frequency matched in 5-year age 
bands to cases. 

The COIN samples were 2151 cases derived from the 
COIN and COIN-B clinical trials of metastatic CRC. 
Median age was 63 years. COIN cases were compared 
against genotypes from 2501 population controls (1237 
males), from the WTCCC2 National Blood Service (NBS) 
cohort (50% male; mean age at diagnosis 53.2 years, 
SD + 15.4). 

The Helsinki (FCCPS) study (http://research.med.helsinki. 
fi/gsb/aaltonen/) comprised 988 cases from a population-based 
collection centred on south-eastern Finland and 864 popula- 
tion controls from the same collection. 

EPICOLON included 1410 CRC cases matched with the 
same number of controls collected in a prospective fashion 
from centres in Spain. Exclusion criteria were Mendelian 
CRC syndromes and a personal history of inflammatory 
bowel disease. 

The Leiden sample set included 858 unselected cases with 
CRC and 690 controls ascertained through genetic testing pro- 
grammes for non-cancer-related conditions from the Leiden 
area. 



In all cases, CRC was defined according to the ninth revi- 
sion of the International Classification of Diseases (ICD) by 
codes 153-154 and all cases had pathologically proven 
disease. Only individuals of white European origin were 
included in the study. 

Sample preparation and genotyping 

Collection of blood samples and clinico-pathological informa- 
tion from patients and controls was undertaken with informed 
consent and ethical review board approval in accordance with 
the tenets of the Declaration of Helsinki. DNA was extracted 
from samples using conventional methods and quantified 
using PicoGreen (Invitrogen). The VQ, UK1, Scotlandl and 
Australia GWA cohorts were genotyped using Illumina 
Hap300, Hap370 or Hap550 arrays. 1958BC and NBS geno- 
typing was performed as part of the WTCCC2 study on 
Hapl.2M arrays. In UK2 and Scotland2, genotyping was con- 
ducted using custom Illumina Infinium arrays according to the 
manufacturer's protocols. Some COIN SNPs were typed on 
custom Illumina Goldengate arrays. To ensure quality of geno- 
typing, a series of duplicate samples was genotyped, resulting 
in 99.9% concordant calls in all cases. 

Other genotyping was conducted using competitive 
allele-specific PCR KASPar chemistry (KBiosciences Ltd, 
Hertfordshire, UK), Taqman (Life Sciences, Carlsbad, CA, 
USA) or MassARRAY (Sequenom Inc., San Diego, CA, USA). 
All primers, probes and conditions used are available on 
request. Genotyping quality control was tested using duplicate 
DNA samples within studies and SNP assays, together with 
direct sequencing of subsets of samples to confirm genotyping 
accuracy. For all SNPs, > 99% concordant results were obtained. 

We excluded SNPs from analysis if they failed one or more 
of the following thresholds: GenCall scores <0.25; overall 
call rates <95%; minor allele frequency (MAF)<0.01; 
departure from the Hardy-Weinberg equilibrium (HWE) in 
controls at _P<10 4 or in cases at P < 10~ 6 ; outlying in 
terms of signal intensity or X:Y ratio; discordance between 
duplicate samples; and, for SNPs with evidence of association, 
poor clustering on inspection of X:Y plots. We excluded 
individuals from the GWA analyses if they had evidence of 
non-white European ancestry by PCA-based analysis in com- 
parison with HapMap samples (http://hapmap.ncbi.nlm.nih. 
gov/) or by self-report. Deviation of the genotype frequencies 
in the controls from those expected under HWE was assessed 
by the \ test (1 df), or Fisher's exact test where an expected 
cell count was <5. 

Association statistics and imputation 

Associations between SNP genotype and disease status were 
primarily assessed in STATA vlO (http://www.stata.com/) 
and PLINK vl.07 (http://pngu.mgh.harvard.edu/~purcell/p 
link/) using allelic and Cochran-Armitage tests (both with 
ldf) respectively, or by Fisher's exact test where an expected 
cell count was <5. Genotypic (2 df), dominant (1 df) and 
recessive (1 df) tests were also performed. The risks associated 
with each SNP were estimated by allelic, heterozygous and 
homozygous ORs using unconditional logistic regression, 
and associated 95% CIs were calculated. 
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Joint analysis of data generated from multiple phases was 
conducted using standard methods for combining raw data 
based on the Mantel-Haenszel method in STATA and in 
PLINK. Joint ORs and 95% CIs were calculated assuming 
fixed- and random-effects models. Tests of the significance 
of the pooled effect sizes were calculated using a standard 
normal distribution. Cochran's Q statistic to test for heterogen- 
eity and the I 2 statistic to quantify the proportion of the total 
variation due to heterogeneity were calculated. Large hetero- 
geneity is typically defined as 7 2 >75%. Where significant het- 
erogeneity was identified, results from the random-effects 
model were reported. Alongside, we also performed 
meta-analysis based on allele dosage (0, 1, 2) and incorporated 
age and sex as co-variates. Although age and sex are asso- 
ciated with the CRC risk, they were not associated with 
SNP genotype and did not materially affect the significance 
of any of the reported associations (data not shown). 

The combined effects of pairs or other multiples of loci 
identified as possibly associated with the CRC risk were 
investigated by unconditional or conditional logistic regres- 
sion analysis in PLINK and STATA to test for independent 
effects of each SNP, stratifying by sample series. Logistic re- 
gression was undertaken both pairwise with the original 
tagSNP and then in a backwards analysis that initially 
included all SNPs with good evidence of association in 
each region. We used Haploview software v4.2 (http 
://www.broadinstitute.org/haploview) to infer the LD struc- 
ture of the genome in the lq41 and 12ql3.13 regions, and 
used the expectation maximum algorithms in Haploview or 
PLINK to infer haplotypes. 

To predict genotypes at untyped SNPs in both regions, 
imputation of the UK1, Scotland 1 and VQ58 data sets was 
performed using the IMPUTE2 software and the combined 
CEU 1000 Genomes low-coverage pilot and complete 
HapMap3 haplotypes reference set, which was filtered to 
remove duplicate haplotypes (both from https://mathgen.sta 
ts.ox.ac.uk/impute/impute_v2.html) (8). Association statistics 
for imputed SNPs were calculated in SNPTEST vl.1.5 
(www.stats.ox.ac.uk/~marchini/software/gwas/snptest.html) 
using the '-proper' option, which is an additive model score 
test based on missing data likelihood, to allow for the uncer- 
tainty of imputed genotypes (9). Imputed markers with prop- 
er_info scores <0.5, imputed call rates per SNP <0.9 
(using a maximum genotype probability threshold of 0.9 to 
call a genotype) and MAFs <0.01 were excluded from the 
analyses. Meta-analyses of the sample sets were carried out 
with Meta (10) (http://www.stats.ox.ac.uk/~jsliu/meta.html) 
and in STATA, using the genotype probabilities from 
IMPUTE2 where a SNP was not directly typed. 

The GENECLUSTER program (11) was used to analyse our 
UK1, Scotlandl and VQ58 samples specifically in order to test 
whether one- or two-SNP models better fitted the association 
signals in each SNP region. GENECLUSTER is a Bayesian 
method that uses HapMap haplotypes to estimate genealogy 
of samples and by jointly testing all SNPs on each branch of 
the genealogy in cases and controls, the program indicates 
the identities of the SNP(s) most likely to have the strongest 
association signal, thus potentially helping to identify 
functional variation. The default model parameters were 
used, specifically mutation model prior: (0.50, 0.50, 0.00), 



max number of trees to consider per location: 1 and beta 
risk prior parameters: (5.00, 5.00). 

Essentially for comparative purposes, we also ran the Mar- 
garita program (12), based on ancestral recombination graphs 
(ARGs), in UK1, Scotlandl and VQ58 for the genotyped SNPs 
in the lq and 12q regions. This program aims to maximize 
available information as to the location and identity of a func- 
tional SNP by reconstructing the genealogical history of the 
sample population. For each ARG, a putative risk mutation 
is placed on the marginal tree and the frequency of each 
branch in cases and controls is assessed. For each region, 
30ARGs were constructed and the significance of a SNP at 
each branchpoint assessed by 10 000 permutations. Unlike 
GENECLUSTER, Margarita does not specifically address 
the issue of whether there are two independent underlying 
SNP in each region, and comparison was therefore restricted 
to the single-SNP scenario. 

Genome co-ordinates were taken from the NCBI build 
36/hgl8 (dbSNP bl26). 

SUPPLEMENTARY MATERIAL 

Supplementary Material is available at HMG online. 
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